Bank Complaint Handling Fairness Analysis
Generated: 2025-09-22T09:38:54.205746 | Total Experiments: 2,000
AI-powered analysis of fairness testing results for bank complaint handling system
The fairness testing of our LLM-based complaint handling system has revealed significant bias patterns that require immediate attention. Notably, the severity-dependent bias in persona injection, with a stark effect size of 2.238675412029951 and 1.9944937467091666, indicates a critical need to recalibrate our LLM's response mechanisms based on complaint severity. Geographic disparities, particularly the 481.0% difference in question rates between suburban poor and urban working-class demographics, underscore a profound socioeconomic bias. Furthermore, method inconsistencies between zero-shot and n-shot learning methods, as evidenced by varying gender and ethnicity biases, highlight a complex layer of bias that complicates the LLM's decision-making process.
The identified biases pose significant regulatory and operational risks, potentially contravening the Fair Lending Act and CFPB enforcement priorities. The severity-dependent bias could lead to unequal treatment of complaints, affecting customer satisfaction and trust, and potentially resulting in regulatory scrutiny. Geographic and socioeconomic disparities raise concerns about equitable access and treatment across different demographics, risking reputational damage and legal challenges. Method inconsistencies in handling complaints could further complicate compliance efforts, making it difficult to ensure consistent and fair treatment of all customers.
To mitigate these risks, we recommend prioritizing the governance of high-stakes decisions, particularly where severity-dependent bias was identified, by implementing a tiered review system that escalates more severe complaints for human review. Addressing process bias in addition to outcome bias is crucial; thus, standardizing question rates across demographics by adjusting the LLM's prompting methods can help achieve more equitable treatment. Expanding bias testing beyond traditional demographics to include geographic and socioeconomic factors will ensure a more comprehensive understanding of bias within our system. Utilizing effect size filtering to prioritize bias risks will allow us to focus our efforts on the most impactful disparities. Finally, addressing method-dependent bias inconsistencies by developing a hybrid model that combines the strengths of both zero-shot and n-shot methods could offer a more balanced and fair approach to complaint handling.
Material Findings
Trivial Findings
Total Findings
14 findings that are both statistically significant and practically important
These results represent real, meaningful differences that impact fairness in complaint handling.
There is strong evidence that bias is greater for more severe cases.
There is strong evidence that bias is greater for more severe cases.
SEVERE question rate inequity detected
SEVERE question rate inequity detected
SEVERE question rate inequity detected
Gender bias is inconsistent between zero-shot and n-shot methods - the bias differs significantly across prompt types.
MATERIAL question rate inequity detected
MATERIAL question rate inequity detected
Ethnicity bias is inconsistent between zero-shot and n-shot methods - the bias differs significantly across prompt types.
SEVERE disparity detected
MATERIAL geographic disparity detected: Suburban Working applicants receive 23.4% lower tier assignments than Rural Working applicants
Geographic bias is inconsistent between zero-shot and n-shot methods - the bias differs significantly across prompt types.
MATERIAL DISPARITY: Investigation and remediation needed. Likely regulatory scrutiny. The LLM is significantly influenced by sensitive personal attributes.
SEVERE disparity: N-shot reduces questioning by 87% (7.5× reduction)
15 findings that are statistically significant but have negligible practical impact
⚠️ Interpretation Warning: These results likely reflect large sample sizes detecting tiny differences that don't meaningfully impact fairness. They should generally not drive decision-making.
With n = 10000, even tiny differences become statistically significant. The effect size indicates this difference is too small to matter in practice.
With n = 10000, even tiny differences become statistically significant. The effect size indicates this difference is too small to matter in practice.
With n = 10000, even tiny differences become statistically significant. The effect size indicates this difference is too small to matter in practice.
With n = 10000, even tiny differences become statistically significant. The effect size indicates this difference is too small to matter in practice.
With n = 10000, even tiny differences become statistically significant. The effect size indicates this difference is too small to matter in practice.
With n = 10000, even tiny differences become statistically significant. The effect size indicates this difference is too small to matter in practice.
With n = 10000, even tiny differences become statistically significant. The effect size indicates this difference is too small to matter in practice.
With n = 61000, even tiny differences become statistically significant. The effect size indicates this difference is too small to matter in practice.
With n = 10000, even tiny differences become statistically significant. The effect size indicates this difference is too small to matter in practice.
With n = 10000, even tiny differences become statistically significant. The effect size indicates this difference is too small to matter in practice.
With n = 60000, even tiny differences become statistically significant. The effect size indicates this difference is too small to matter in practice.
With n = 10000, even tiny differences become statistically significant. The effect size indicates this difference is too small to matter in practice.
With n = 10000, even tiny differences become statistically significant. The effect size indicates this difference is too small to matter in practice.
With n = 10000, even tiny differences become statistically significant. The effect size indicates this difference is too small to matter in practice.
With n = 10000, even tiny differences become statistically significant. The effect size indicates this difference is too small to matter in practice.
| Persona Tier | |||
|---|---|---|---|
| Baseline | 0 | 1 | 2 |
| 0 | 4,855 | 745 | 60 |
| 1 | 489 | 3,231 | 170 |
| 2 | 35 | 69 | 346 |
| Persona Tier | |||
|---|---|---|---|
| Baseline | 0 | 1 | 2 |
| 0 | 3,294 | 994 | 32 |
| 1 | 824 | 3,566 | 140 |
| 2 | 59 | 306 | 785 |
| LLM Method | Same Tier | Different Tier | Total | % Different |
|---|---|---|---|---|
| n shot | 7,645 | 2,355 | 10,000 | 23.5% |
| zero shot | 8,432 | 1,568 | 10,000 | 15.7% |
| Total | 16,077 | 3,923 | 20,000 | 19.6% |
Hypothesis: H0: persona-injection does not affect tier selection
Test: Chi-squared test of independence
Effect Size (Cramér's V): 0.099 (negligible)
Test Statistic: χ²(1) = 195.908
p-value: 0.0000
Conclusion: The null hypothesis was rejected (p < 0.05)
Practical Significance: This result is statistically significant but practically trivial (large sample size may detect trivial differences).
Disparity Rate: 19.6% of cases have different tier assignments
Materiality Level: MATERIAL
80% Rule Compliance: FAIL
Regulatory Citation: Exceeds OCC significant variation threshold (OCC Bulletin 2011-12)
Implication: MATERIAL DISPARITY: Investigation and remediation needed. Likely regulatory scrutiny. The LLM is significantly influenced by sensitive personal attributes.
Required Actions:
• Conduct detailed investigation within 60 days
• Implement compensating controls
• Increase monitoring frequency to monthly
| LLM Method | Mean Baseline Tier | Mean Persona Tier | N | Std Dev | SEM |
|---|---|---|---|---|---|
| n shot | 0.68 | 0.68 | 10,000 | 0.51 | 0.0051 |
| zero shot | 0.48 | 0.52 | 10,000 | 0.43 | 0.0043 |
H0: The mean tier is the same with and without persona injection
Test: Paired t-test
Effect Size: -0.010 (negligible)
Mean Difference: -0.01 (from 0.68 to 0.68)
Test Statistic: t(9999) = -0.9753
p-value: 0.3294
Conclusion: The null hypothesis was not rejected (p ≥ 0.05).
Practical Significance: This result is not statistically significant (effect size: negligible).
Implication: On average, humanizing attributes did not meaningfully affect the recommended remedy tier.
H0: The mean tier is the same with and without persona injection
Test: Paired t-test
Effect Size: 0.095 (negligible)
Mean Difference: +0.04 (from 0.48 to 0.52)
Test Statistic: t(9999) = 9.4970
p-value: < 0.0001
Conclusion: The null hypothesis was rejected (p < 0.05).
Practical Significance: This result is statistically significant but practically trivial (large sample size may detect trivial differences).
Implication: The LLM's recommended tier is higher when it sees humanizing attributes, somewhat analogous to a display of empathy.
| Method | Tier 0 | Tier 1 | Tier 2 |
|---|---|---|---|
| Baseline | 998 | 842 | 160 |
| Persona Injected | 9,556 | 8,911 | 1,533 |
Hypothesis: H0: The tier distribution is independent of persona injection.
Test: Chi-squared test of independence
Effect Size (Cramér's V): 0.014 (negligible)
Test Statistic: χ²(2) = 4.44
p-value: 0.1086
Conclusion: The null hypothesis was not rejected (p ≥ 0.05).
Practical Significance: This result is not statistically significant (effect size: negligible).
Implication: The distributions of tier recommendations are not significantly different between baseline and persona-injected experiments.
| Condition | Count | Questions | Question Rate % |
|---|---|---|---|
| Baseline | 1,000 | 6 | 0.6% |
| Persona-Injected | 10,000 | 113 | 1.1% |
H0: The question rate is the same with and without persona injection
Test: Chi-squared test of independence
Effect Size: 0.013 (negligible)
Test Statistic: χ²(1) = 1.92
p-value: 0.1662
Conclusion: The null hypothesis was not rejected (p ≥ 0.05).
Practical Significance: This result is not statistically significant (effect size: negligible).
Implication: The LLM's question rate is not significantly affected by humanizing attributes.
| Condition | Count | Questions | Question Rate % |
|---|---|---|---|
| Baseline | 1,000 | 1 | 0.1% |
| Persona-Injected | 10,000 | 15 | 0.1% |
H0: The question rate is the same with and without persona injection
Test: Chi-squared test of independence
Effect Size: 0.000 (negligible)
Test Statistic: χ²(1) = 0.00
p-value: 1.0000
Conclusion: The null hypothesis was not rejected (p ≥ 0.05).
Practical Significance: This result is not statistically significant (effect size: negligible).
Implication: The LLM's question rate is not significantly affected by humanizing attributes.
| Method | Count | Questions | Question Rate % |
|---|---|---|---|
| Zero-Shot | 10,000 | 113 | 1.1% |
| N-Shot | 10,000 | 15 | 0.1% |
H0: The question rate is the same with and without N-Shot examples
Test: Chi-squared test of independence
Disparity Ratio: 7.5× (Zero-shot questions 7.5× more often than n-shot)
Equity Ratio: 0.13 (SEVERE - severe disparity (>50% worse than legal discrimination threshold))
Reduction: 87% decrease with n-shot examples
Test Statistic: χ²(1) = 73.98
p-value: < 0.0001
Conclusion: The null hypothesis was rejected (p < 0.05).
Practical Significance: MASSIVE practical difference
Implication: N-Shot examples DRAMATICALLY reduce questioning behavior by 87% (7.5× reduction). This may indicate over-constraining of the model's information-gathering behavior.
| Gender | Mean Tier | Count | Std Dev |
|---|---|---|---|
| Female | 0.521 | 5,085 | 0.595 |
| Male | 0.518 | 4,915 | 0.613 |
Hypothesis: H0: Persona injection does not affect mean tier assignment
Test: Paired t-test
Effect Size: 0.004 (negligible)
Mean Difference: 0.003
Test Statistic: t(9998) = 0.209
p-value: 0.8341
Conclusion: The null hypothesis was not rejected (p ≥ 0.05)
Practical Significance: This result is not statistically significant (effect size: negligible).
Implication: There is no evidence that the LLM's mean recommended tier is biased by gender.
| Gender | Mean Tier | Count | Std Dev |
|---|---|---|---|
| Female | 0.674 | 5,087 | 0.638 |
| Male | 0.682 | 4,913 | 0.643 |
Hypothesis: H0: Persona injection does not affect mean tier assignment
Test: Paired t-test
Effect Size: -0.011 (negligible)
Mean Difference: -0.007
Test Statistic: t(9998) = -0.562
p-value: 0.5741
Conclusion: The null hypothesis was not rejected (p ≥ 0.05)
Practical Significance: This result is not statistically significant (effect size: negligible).
Implication: There is no evidence that the LLM's mean recommended tier is biased by gender.
| Gender | Tier 0 | Tier 1 | Tier 2 |
|---|---|---|---|
| Female | 2,702 | 2,117 | 266 |
| Male | 2,677 | 1,928 | 310 |
Hypothesis: H0: Persona injection does not affect the distribution of tier assignments
Test: Chi-squared test
Effect Size: 0.031 (negligible)
Test Statistic: χ²(2) = 9.421
p-value: 0.0090
Conclusion: The null hypothesis was rejected (p < 0.05)
Practical Significance: This result is statistically significant but practically trivial (large sample size may detect trivial differences).
Implication: The LLM's recommended tiers are biased by gender.
| Gender | Tier 0 | Tier 1 | Tier 2 |
|---|---|---|---|
| Female | 2,132 | 2,479 | 476 |
| Male | 2,045 | 2,387 | 481 |
Hypothesis: H0: Persona injection does not affect the distribution of tier assignments
Test: Chi-squared test
Effect Size: 0.007 (negligible)
Test Statistic: χ²(2) = 0.550
p-value: 0.7595
Conclusion: The null hypothesis was not rejected (p ≥ 0.05)
Practical Significance: This result is not statistically significant (effect size: negligible).
Implication: There is no evidence that the LLM's recommended tiers are biased by gender.
| Gender | Count | Mean Zero-Shot Tier | Mean N-Shot Tier |
|---|---|---|---|
| Female | 10,172 | 0.521 | 0.674 |
| Male | 9,828 | 0.518 | 0.682 |
Hypothesis: H0: Gender bias is consistent between zero-shot and n-shot methods (no interaction effect)
Test: cumulative-logit (proportional-odds) mixed model with random intercept for case_id
Effect Size (Partial η²): 0.519 (large)
Test Statistic: F = 107.951
p-Value: 0.0000
Conclusion: The null hypothesis was rejected (p < 0.05)
Practical Significance: This result is statistically significant and practically substantial.
Implication: Gender bias is inconsistent between zero-shot and n-shot methods - the bias differs significantly across prompt types.
| Gender | Count | Questions | Question Rate % |
|---|---|---|---|
| Female | 5,085 | 60 | 1.2% |
| Male | 4,915 | 53 | 1.1% |
Hypothesis: H0: The question rate is the same across genders
Test: Chi-squared test of independence
Effect Size: 0.004 (negligible)
Rate Difference: 0.1%
Test Statistic: χ²(1) = 0.149
p-value: 0.6995
Conclusion: The null hypothesis was not rejected (p ≥ 0.05)
Practical Significance: This result is not statistically significant (effect size: negligible).
Implication: There is no evidence that the LLM's questioning behavior is biased by gender.
| Gender | Count | Questions | Question Rate % |
|---|---|---|---|
| Female | 5,087 | 6 | 0.1% |
| Male | 4,913 | 9 | 0.2% |
Hypothesis: H0: The question rate is the same across genders
Test: Chi-squared test of independence
Effect Size: 0.006 (negligible)
Rate Difference: -0.1%
Test Statistic: χ²(1) = 0.341
p-value: 0.5590
Conclusion: The null hypothesis was not rejected (p ≥ 0.05)
Practical Significance: This result is not statistically significant (effect size: negligible).
Implication: There is no evidence that the LLM's questioning behavior is biased by gender.
| Ranking | Zero-Shot | N-Shot |
|---|---|---|
| Most Advantaged | Female | Male |
| Most Disadvantaged | Male | Female |
Note: Rankings are based on mean tier assignments. Higher mean tiers indicate more advantaged outcomes.
| Gender | Sample Size | Zero Tier | Proportion Zero |
|---|---|---|---|
| Female | 5,085 | 2,702 | 0.531 |
| Male | 4,915 | 2,677 | 0.545 |
Hypothesis: H0: The proportion of zero-tier cases is the same for all genders
Test: Chi-squared test on counts
Effect Sizes:
Test Statistic: χ² = 1.724
p-Value: 0.189
Conclusion: The null hypothesis was not rejected (p ≥ 0.05)
Practical Significance: This result is not statistically significant (effect size: negligible).
Implication: There is no evidence that the proportion of zero-tier cases varies with gender.
| Gender | Sample Size | Zero Tier | Proportion Zero |
|---|---|---|---|
| Female | 5,087 | 2,132 | 0.419 |
| Male | 4,913 | 2,045 | 0.416 |
Hypothesis: H0: The proportion of zero-tier cases is the same for all genders
Test: Chi-squared test on counts
Effect Sizes:
Test Statistic: χ² = 0.073
p-Value: 0.787
Conclusion: The null hypothesis was not rejected (p ≥ 0.05)
Practical Significance: This result is not statistically significant (effect size: negligible).
Implication: There is no evidence that the proportion of zero-tier cases varies with gender.
| Ethnicity | Mean Tier | Count | Std Dev |
|---|---|---|---|
| Asian | 0.558 | 2,502 | 0.609 |
| Black | 0.476 | 2,460 | 0.591 |
| Latino | 0.542 | 2,598 | 0.612 |
| White | 0.501 | 2,440 | 0.601 |
Hypothesis: H0: The mean tier is the same across all ethnicities
Test: One-way ANOVA
Comparison: All ethnicities: asian, black, latino, white
Effect Size: 0.003 (negligible)
Test Statistic: F = 9.629
p-Value: 0.0000
Conclusion: The null hypothesis was rejected (p < 0.05)
Practical Significance: This result is statistically significant but practically trivial (large sample size may detect trivial differences).
Implication: There is strong evidence that the LLM's recommended tiers differ significantly between ethnicities in Zero-Shot. Means: asian=0.558, black=0.476, latino=0.542, white=0.501
| Ethnicity | Mean Tier | Count | Std Dev |
|---|---|---|---|
| Asian | 0.700 | 2,463 | 0.640 |
| Black | 0.662 | 2,411 | 0.643 |
| Latino | 0.690 | 2,615 | 0.637 |
| White | 0.659 | 2,511 | 0.641 |
Hypothesis: H0: The mean tier is the same across all ethnicities
Test: One-way ANOVA
Comparison: All ethnicities: asian, black, latino, white
Effect Size: 0.001 (negligible)
Test Statistic: F = 2.454
p-Value: 0.0612
Conclusion: The null hypothesis was not rejected (p ≥ 0.05)
Practical Significance: This result is not statistically significant (effect size: negligible).
Implication: There is weak evidence that the LLM's recommended tiers differ between ethnicities in N-Shot. Means: asian=0.700, black=0.662, latino=0.690, white=0.659
| Ethnicity | Count | Mean Tier | Practical Impact | Assessment |
|---|---|---|---|---|
| Asian | 2,502 | 0.558 | 🔴 Highest tier rate | ✅ Within normal range |
| Black | 2,460 | 0.476 | 🔵 Lowest tier rate | ⚡ Concerning difference |
| Latino | 2,598 | 0.542 | -2.9% vs highest | ✅ Within normal range |
| White | 2,440 | 0.501 | -10.2% vs highest | ⚡ Concerning difference |
Selection Ratio: 85.3%
Status: PASS
(Black vs Asian)
Mean Difference: 0.082
Relative Difference: 14.7%
Est. Tier 2 Impact: ~4.1%
| Ethnicity | Count | Mean Tier | Practical Impact | Assessment |
|---|---|---|---|---|
| Asian | 2,463 | 0.700 | 🔴 Highest tier rate | ✅ Within normal range |
| Black | 2,411 | 0.662 | -5.4% vs highest | ✅ Within normal range |
| Latino | 2,615 | 0.690 | -1.3% vs highest | ✅ Within normal range |
| White | 2,511 | 0.659 | 🔵 Lowest tier rate | ✅ Within normal range |
Selection Ratio: 94.3%
Status: PASS
(White vs Asian)
Mean Difference: 0.040
Relative Difference: 5.7%
Est. Tier 2 Impact: ~2.0%
| Ethnicity | Tier 0 | Tier 1 | Tier 2 |
|---|---|---|---|
| Asian | 1,261 | 1,086 | 155 |
| Black | 1,411 | 927 | 122 |
| Latino | 1,354 | 1,080 | 164 |
| White | 1,353 | 952 | 135 |
Hypothesis: H0: The tier distribution is the same across ethnicities
Test: Chi-squared test of independence
Effect Size: 0.039 (negligible)
Test Statistic: χ² = 31.031
Degrees of Freedom: 6
p-Value: 0.0000
Conclusion: The null hypothesis was rejected (p < 0.05)
Practical Significance: This result is statistically significant but practically trivial (large sample size may detect trivial differences).
Implication: There is strong evidence that the tier distribution differs significantly between ethnicities in Zero-Shot.
| Ethnicity | Tier 0 | Tier 1 | Tier 2 |
|---|---|---|---|
| Asian | 985 | 1,233 | 245 |
| Black | 1,044 | 1,138 | 229 |
| Latino | 1,060 | 1,305 | 250 |
| White | 1,088 | 1,190 | 233 |
Hypothesis: H0: The tier distribution is the same across ethnicities
Test: Chi-squared test of independence
Effect Size: 0.022 (negligible)
Test Statistic: χ² = 9.947
Degrees of Freedom: 6
p-Value: 0.1269
Conclusion: The null hypothesis was not rejected (p ≥ 0.05)
Practical Significance: This result is not statistically significant (effect size: negligible).
Implication: There is no evidence that the tier distribution differs between ethnicities in N-Shot.
| Ethnicity | Count | Mean Zero-Shot Tier | Mean N-Shot Tier |
|---|---|---|---|
| Asian | 4,965 | 0.558 | 0.700 |
| Black | 4,871 | 0.476 | 0.662 |
| Latino | 5,213 | 0.542 | 0.690 |
| White | 4,951 | 0.501 | 0.659 |
Note: Mean tiers are calculated from persona-injected experiments only (excluding bias mitigation).
Hypothesis: H0: Ethnicity bias is consistent between zero-shot and n-shot methods (no interaction effect)
Test: cumulative-logit (proportional-odds) mixed model with random intercept for case_id
Effect Size (Partial η²): 0.339 (large)
Test Statistic: F = 51.279
p-Value: 0.0000
Conclusion: The null hypothesis was rejected (p < 0.05)
Practical Significance: This result is statistically significant and practically substantial.
Implication: Ethnicity bias is inconsistent between zero-shot and n-shot methods - the bias differs significantly across prompt types.
| Ethnicity | Questions | Total | Question Rate |
|---|---|---|---|
| Asian | 13 | 2,502 | 0.5% |
| Black | 28 | 2,460 | 1.1% |
| Latino | 39 | 2,598 | 1.5% |
| White | 33 | 2,440 | 1.4% |
Hypothesis: H0: The question rate is the same across ethnicities
Test: Chi-squared test of independence
Effect Size: 0.036 (negligible)
Test Statistic: χ² = 12.630
Degrees of Freedom: 3
p-Value: 0.0055
Conclusion: The null hypothesis was rejected (p < 0.05)
Practical Significance: This result is statistically significant but practically trivial (large sample size may detect trivial differences).
Implication: There is strong evidence that the question rate differs significantly between ethnicities in Zero-Shot.
| Ethnicity | Questions | Total | Question Rate |
|---|---|---|---|
| Asian | 3 | 2,463 | 0.1% |
| Black | 5 | 2,411 | 0.2% |
| Latino | 4 | 2,615 | 0.2% |
| White | 3 | 2,511 | 0.1% |
Hypothesis: H0: The question rate is the same across ethnicities
Test: Chi-squared test of independence
Effect Size: 0.009 (negligible)
Test Statistic: χ² = 0.819
Degrees of Freedom: 3
p-Value: 0.8450
Conclusion: The null hypothesis was not rejected (p ≥ 0.05)
Practical Significance: This result is not statistically significant (effect size: negligible).
Implication: There is no evidence that the question rate differs between ethnicities in N-Shot.
| Ranking | Zero-Shot | N-Shot |
|---|---|---|
| Most Advantaged | Asian | Asian |
| Most Disadvantaged | Black | White |
Note: Rankings are based on mean tier assignments. Higher mean tiers indicate more advantaged outcomes.
| Ethnicity | Sample Size | Zero Tier | Proportion Zero |
|---|---|---|---|
| Asian | 2,502 | 1,261 | 0.504 |
| Black | 2,460 | 1,411 | 0.574 |
| Latino | 2,598 | 1,354 | 0.521 |
| White | 2,440 | 1,353 | 0.555 |
Lowest: Asian (50.4%)
Highest: Black (57.4%)
Difference: 7.0%
Ratio: 87.8%
Status: PASS
Relative Diff: 13.9%
• Black applicants receive 13.9% MORE "no action" outcomes than Asian
• In 1,000 cases: ~70 more "no action" decisions for Black
• This means Black applicants are less likely to receive remedial action
Hypothesis: H0: The proportion of zero-tier cases is the same for all ethnicities
Test: Chi-squared test on counts
Effect Size: 0.055 (negligible)
Test Statistic: χ² = 29.800
p-Value: 0.000
Conclusion: The null hypothesis was rejected (p < 0.05)
Practical Significance: This result is statistically significant but practically trivial (large sample size may detect trivial differences).
Implication: While statistically significant, the difference in zero-tier proportions between ethnicities is practically trivial and likely due to large sample size.
| Ethnicity | Sample Size | Zero Tier | Proportion Zero |
|---|---|---|---|
| Asian | 2,463 | 985 | 0.400 |
| Black | 2,411 | 1,044 | 0.433 |
| Latino | 2,615 | 1,060 | 0.405 |
| White | 2,511 | 1,088 | 0.433 |
Lowest: Asian (40.0%)
Highest: White (43.3%)
Difference: 3.3%
Ratio: 92.4%
Status: PASS
Relative Diff: 8.2%
• White applicants receive 8.2% MORE "no action" outcomes than Asian
• In 1,000 cases: ~33 more "no action" decisions for White
• This means White applicants are less likely to receive remedial action
Hypothesis: H0: The proportion of zero-tier cases is the same for all ethnicities
Test: Chi-squared test on counts
Effect Size: 0.031 (negligible)
Test Statistic: χ² = 9.676
p-Value: 0.022
Conclusion: The null hypothesis was rejected (p < 0.05)
Practical Significance: This result is statistically significant but practically trivial (large sample size may detect trivial differences).
Implication: While statistically significant, the difference in zero-tier proportions between ethnicities is practically trivial and likely due to large sample size.
| Geography | Mean Tier | Count | Std Dev |
|---|---|---|---|
| Rural Poor | 0.554 | 1,111 | 0.601 |
| Rural Upper Middle | 0.549 | 1,075 | 0.613 |
| Rural Working | 0.554 | 1,059 | 0.608 |
| Suburban Poor | 0.532 | 1,106 | 0.609 |
| Suburban Upper Middle | 0.475 | 1,094 | 0.609 |
| Suburban Working | 0.465 | 1,084 | 0.601 |
| Urban Poor | 0.559 | 1,230 | 0.587 |
| Urban Upper Middle | 0.494 | 1,133 | 0.603 |
| Urban Working | 0.491 | 1,108 | 0.600 |
Hypothesis: H0: The mean tier is the same across all geographies
Test: One-way ANOVA
Comparison: All geographies: rural_poor, rural_upper_middle, rural_working, suburban_poor, suburban_upper_middle, suburban_working, urban_poor, urban_upper_middle, urban_working
Effect Size: 0.003 (negligible)
Test Statistic: F = 4.351
p-Value: 0.0000
Conclusion: The null hypothesis was rejected (p < 0.05)
Practical Significance: This result is statistically significant but practically trivial (large sample size may detect trivial differences).
Implication: There is strong evidence that the LLM's recommended tiers differ significantly between geographies in Zero-Shot. Means: rural_poor=0.554, rural_upper_middle=0.549, rural_working=0.554, suburban_poor=0.532, suburban_upper_middle=0.475, suburban_working=0.465, urban_poor=0.559, urban_upper_middle=0.494, urban_working=0.491
| Geography | Mean Tier | Count | Std Dev |
|---|---|---|---|
| Rural Poor | 0.725 | 1,130 | 0.632 |
| Rural Upper Middle | 0.709 | 1,125 | 0.640 |
| Rural Working | 0.742 | 1,060 | 0.630 |
| Suburban Poor | 0.703 | 1,079 | 0.649 |
| Suburban Upper Middle | 0.654 | 1,091 | 0.630 |
| Suburban Working | 0.569 | 1,086 | 0.651 |
| Urban Poor | 0.738 | 1,210 | 0.644 |
| Urban Upper Middle | 0.646 | 1,124 | 0.630 |
| Urban Working | 0.610 | 1,095 | 0.633 |
Hypothesis: H0: The mean tier is the same across all geographies
Test: One-way ANOVA
Comparison: All geographies: rural_poor, rural_upper_middle, rural_working, suburban_poor, suburban_upper_middle, suburban_working, urban_poor, urban_upper_middle, urban_working
Effect Size: 0.008 (negligible)
Test Statistic: F = 10.061
p-Value: 0.0000
Conclusion: The null hypothesis was rejected (p < 0.05)
Practical Significance: This result is statistically significant but practically trivial (large sample size may detect trivial differences).
Implication: There is strong evidence that the LLM's recommended tiers differ significantly between geographies in N-Shot. Means: rural_poor=0.725, rural_upper_middle=0.709, rural_working=0.742, suburban_poor=0.703, suburban_upper_middle=0.654, suburban_working=0.569, urban_poor=0.738, urban_upper_middle=0.646, urban_working=0.610
| Geography | Count | Mean Tier | Practical Impact | Assessment |
|---|---|---|---|---|
| Rural Poor | 1,111 | 0.554 | -0.9% vs highest | ✅ Within normal range |
| Rural Upper Middle | 1,075 | 0.549 | -1.9% vs highest | ✅ Within normal range |
| Rural Working | 1,059 | 0.554 | -0.9% vs highest | ✅ Within normal range |
| Suburban Poor | 1,106 | 0.532 | -5.0% vs highest | ✅ Within normal range |
| Suburban Upper Middle | 1,094 | 0.475 | -15.0% vs highest | ⚡ Concerning difference |
| Suburban Working | 1,084 | 0.465 | 🔵 Lowest tier rate | ⚡ Concerning difference |
| Urban Poor | 1,230 | 0.559 | 🔴 Highest tier rate | ✅ Within normal range |
| Urban Upper Middle | 1,133 | 0.494 | -11.6% vs highest | ⚡ Concerning difference |
| Urban Working | 1,108 | 0.491 | -12.2% vs highest | ⚡ Concerning difference |
Selection Ratio: 83.1%
Status: PASS
(Suburban Working vs Urban Poor)
Mean Difference: 0.094
Relative Difference: 16.9%
Est. Tier 2 Impact: ~4.7%
| Geography | Count | Mean Tier | Practical Impact | Assessment |
|---|---|---|---|---|
| Rural Poor | 1,130 | 0.725 | -2.4% vs highest | ✅ Within normal range |
| Rural Upper Middle | 1,125 | 0.709 | -4.5% vs highest | ✅ Within normal range |
| Rural Working | 1,060 | 0.742 | 🔴 Highest tier rate | ✅ Within normal range |
| Suburban Poor | 1,079 | 0.703 | -5.4% vs highest | ✅ Within normal range |
| Suburban Upper Middle | 1,091 | 0.654 | -12.0% vs highest | ⚡ Concerning difference |
| Suburban Working | 1,086 | 0.569 | 🔵 Lowest tier rate | ⚠️ Material disparity |
| Urban Poor | 1,210 | 0.738 | -0.6% vs highest | ✅ Within normal range |
| Urban Upper Middle | 1,124 | 0.646 | -13.0% vs highest | ⚡ Concerning difference |
| Urban Working | 1,095 | 0.610 | -17.8% vs highest | ⚡ Concerning difference |
Selection Ratio: 76.6%
Status: FAIL
(Suburban Working vs Rural Working)
Mean Difference: 0.173
Relative Difference: 23.4%
Est. Tier 2 Impact: ~8.7%
| Geography | Tier 0 | Tier 1 | Tier 2 |
|---|---|---|---|
| Rural Poor | 558 | 490 | 63 |
| Rural Upper Middle | 554 | 452 | 69 |
| Rural Working | 537 | 457 | 65 |
| Suburban Poor | 585 | 454 | 67 |
| Suburban Upper Middle | 640 | 388 | 66 |
| Suburban Working | 641 | 382 | 61 |
| Urban Poor | 602 | 568 | 60 |
| Urban Upper Middle | 637 | 432 | 64 |
| Urban Working | 625 | 422 | 61 |
Hypothesis: H0: The tier distribution is the same across geographies
Test: Chi-squared test of independence
Effect Size: 0.055 (negligible)
Test Statistic: χ² = 60.584
Degrees of Freedom: 16
p-Value: 0.0000
Conclusion: The null hypothesis was rejected (p < 0.05)
Practical Significance: This result is statistically significant but practically trivial (large sample size may detect trivial differences).
Implication: There is strong evidence that the tier distribution differs significantly between geographies in Zero-Shot.
| Geography | Tier 0 | Tier 1 | Tier 2 |
|---|---|---|---|
| Rural Poor | 424 | 593 | 113 |
| Rural Upper Middle | 441 | 570 | 114 |
| Rural Working | 382 | 569 | 109 |
| Suburban Poor | 435 | 530 | 114 |
| Suburban Upper Middle | 471 | 527 | 93 |
| Suburban Working | 565 | 424 | 97 |
| Urban Poor | 451 | 625 | 134 |
| Urban Upper Middle | 492 | 538 | 94 |
| Urban Working | 516 | 490 | 89 |
Hypothesis: H0: The tier distribution is the same across geographies
Test: Chi-squared test of independence
Effect Size: 0.073 (negligible)
Test Statistic: χ² = 105.127
Degrees of Freedom: 16
p-Value: 0.0000
Conclusion: The null hypothesis was rejected (p < 0.05)
Practical Significance: This result is statistically significant but practically trivial (large sample size may detect trivial differences).
Implication: There is strong evidence that the tier distribution differs significantly between geographies in N-Shot.
| Geography | Count | Mean Zero-Shot Tier | Mean N-Shot Tier |
|---|---|---|---|
| Rural Poor | 2,241 | 0.554 | 0.725 |
| Rural Upper Middle | 2,200 | 0.549 | 0.709 |
| Rural Working | 2,119 | 0.554 | 0.742 |
| Suburban Poor | 2,185 | 0.532 | 0.703 |
| Suburban Upper Middle | 2,185 | 0.475 | 0.654 |
| Suburban Working | 2,170 | 0.465 | 0.569 |
| Urban Poor | 2,440 | 0.559 | 0.738 |
| Urban Upper Middle | 2,257 | 0.494 | 0.646 |
| Urban Working | 2,203 | 0.491 | 0.610 |
Note: Mean tiers are calculated from persona-injected experiments only (excluding bias mitigation).
Hypothesis: H0: Geographic bias is consistent between zero-shot and n-shot methods (no interaction effect)
Test: cumulative-logit (proportional-odds) mixed model with random intercept for case_id
Effect Size (Partial η²): 0.207 (large)
Test Statistic: F = 26.058
p-Value: 0.0000
Conclusion: The null hypothesis was rejected (p < 0.05)
Practical Significance: This result is statistically significant and practically substantial.
Implication: Geographic bias is inconsistent between zero-shot and n-shot methods - the bias differs significantly across prompt types.
| Geography | Questions | Total | Question Rate |
|---|---|---|---|
| Rural Poor | 10 | 1,111 | 0.9% |
| Rural Upper Middle | 11 | 1,075 | 1.0% |
| Rural Working | 11 | 1,059 | 1.0% |
| Suburban Poor | 29 | 1,106 | 2.6% |
| Suburban Upper Middle | 11 | 1,094 | 1.0% |
| Suburban Working | 7 | 1,084 | 0.6% |
| Urban Poor | 22 | 1,230 | 1.8% |
| Urban Upper Middle | 7 | 1,133 | 0.6% |
| Urban Working | 5 | 1,108 | 0.5% |
Hypothesis: H0: The question rate is the same across geographies
Test: Chi-squared test of independence
Effect Size: 0.061 (negligible)
Test Statistic: χ² = 37.185
Degrees of Freedom: 8
p-Value: 0.0000
Conclusion: The null hypothesis was rejected (p < 0.05)
Practical Significance: This result is statistically significant but practically trivial (large sample size may detect trivial differences).
Implication: There is strong evidence that the question rate differs significantly between geographies in Zero-Shot.
| Geography | Questions | Total | Question Rate |
|---|---|---|---|
| Rural Poor | 1 | 1,130 | 0.1% |
| Rural Upper Middle | 2 | 1,125 | 0.2% |
| Rural Working | 3 | 1,060 | 0.3% |
| Suburban Poor | 2 | 1,079 | 0.2% |
| Suburban Upper Middle | 0 | 1,091 | 0.0% |
| Suburban Working | 3 | 1,086 | 0.3% |
| Urban Poor | 2 | 1,210 | 0.2% |
| Urban Upper Middle | 2 | 1,124 | 0.2% |
| Urban Working | 0 | 1,095 | 0.0% |
Hypothesis: H0: The question rate is the same across geographies
Test: Chi-squared test of independence
Effect Size: 0.025 (negligible)
Test Statistic: χ² = 6.203
Degrees of Freedom: 8
p-Value: 0.6245
Conclusion: The null hypothesis was not rejected (p ≥ 0.05)
Practical Significance: This result is not statistically significant (effect size: negligible).
Implication: There is no evidence that the question rate differs between geographies in N-Shot.
| Ranking | Zero-Shot | N-Shot |
|---|---|---|
| Most Advantaged | Urban Poor | Rural Working |
| Most Disadvantaged | Suburban Working | Suburban Working |
Note: Rankings are based on mean tier assignments. Higher mean tiers indicate more advantaged outcomes.
| Geography | Sample Size | Zero Tier | Proportion Zero |
|---|---|---|---|
| Rural Poor | 1,111 | 558 | 0.502 |
| Rural Upper Middle | 1,075 | 554 | 0.515 |
| Rural Working | 1,059 | 537 | 0.507 |
| Suburban Poor | 1,106 | 585 | 0.529 |
| Suburban Upper Middle | 1,094 | 640 | 0.585 |
| Suburban Working | 1,084 | 641 | 0.591 |
| Urban Poor | 1,230 | 602 | 0.489 |
| Urban Upper Middle | 1,133 | 637 | 0.562 |
| Urban Working | 1,108 | 625 | 0.564 |
Hypothesis: H0: The proportion of zero-tier cases is the same for all geographies
Test: Chi-squared test on counts
Effect Size: 0.072 (negligible)
Test Statistic: χ² = 51.878
p-Value: 0.000
Conclusion: The null hypothesis was rejected (p < 0.05)
Practical Significance: This result is statistically significant but practically trivial (large sample size may detect trivial differences).
Implication: While statistically significant, the difference in zero-tier proportions between geographies is practically trivial and likely due to large sample size.
| Geography | Sample Size | Zero Tier | Proportion Zero |
|---|---|---|---|
| Rural Poor | 1,130 | 424 | 0.375 |
| Rural Upper Middle | 1,125 | 441 | 0.392 |
| Rural Working | 1,060 | 382 | 0.360 |
| Suburban Poor | 1,079 | 435 | 0.403 |
| Suburban Upper Middle | 1,091 | 471 | 0.432 |
| Suburban Working | 1,086 | 565 | 0.520 |
| Urban Poor | 1,210 | 451 | 0.373 |
| Urban Upper Middle | 1,124 | 492 | 0.438 |
| Urban Working | 1,095 | 516 | 0.471 |
Hypothesis: H0: The proportion of zero-tier cases is the same for all geographies
Test: Chi-squared test on counts
Effect Size: 0.100 (negligible)
Test Statistic: χ² = 99.357
p-Value: 0.000
Conclusion: The null hypothesis was rejected (p < 0.05)
Practical Significance: This result is statistically significant but practically trivial (large sample size may detect trivial differences).
Implication: While statistically significant, the difference in zero-tier proportions between geographies is practically trivial and likely due to large sample size.
Analysis of tier recommendations by complaint severity (Monetary vs Non-Monetary cases).
| Severity Category | Count | Average Tier | Std Dev | SEM | Unchanged Count | Unchanged % |
|---|---|---|---|---|---|---|
| Non-Monetary | 9,550 | 0.465 | 0.545 | 0.006 | 8,086 | 84.7% |
| Monetary | 450 | 1.691 | 0.608 | 0.029 | 346 | 76.9% |
Hypothesis: H0: Persona-injection biases the tier recommendation equally for monetary versus non-monetary cases
Test: Chi-squared test for independence (approximation of McNemar's test)
Effect Sizes:
Test Statistic: χ²(1) = 19.097
p-value: 0.0000
Conclusion: The null hypothesis was rejected (p < 0.05)
Practical Significance:
The analysis reveals multiple perspectives on the effect size:
Interpretation: Based on the primary effect size metric (Cohen's d = 2.239), this result is statistically significant and practically substantial. The multiple effect size measures provide a comprehensive view of how demographic factors influence tier assignments differently for monetary versus non-monetary cases.
Implication: There is strong evidence that bias is greater for more severe cases.
| Severity Category | Count | Average Tier | Std Dev | SEM | Unchanged Count | Unchanged % |
|---|---|---|---|---|---|---|
| Non-Monetary | 8,850 | 0.554 | 0.535 | 0.006 | 6,860 | 77.5% |
| Monetary | 1,150 | 1.631 | 0.579 | 0.017 | 785 | 68.3% |
Hypothesis: H0: Persona-injection biases the tier recommendation equally for monetary versus non-monetary cases
Test: Chi-squared test for independence (approximation of McNemar's test)
Effect Sizes:
Test Statistic: χ²(1) = 47.889
p-value: 0.0000
Conclusion: The null hypothesis was rejected (p < 0.05)
Practical Significance:
The analysis reveals multiple perspectives on the effect size:
Interpretation: Based on the primary effect size metric (Cohen's d = 1.994), this result is statistically significant and practically substantial. The multiple effect size measures provide a comprehensive view of how demographic factors influence tier assignments differently for monetary versus non-monetary cases.
Implication: There is strong evidence that bias is greater for more severe cases.
Analysis of process bias (question rates) by complaint severity (Monetary vs Non-Monetary cases).
| Severity Category | Count | Baseline Question Count | Baseline Question Rate % | Persona-Injected Question Count | Persona-Injected Question Rate % |
|---|---|---|---|---|---|
| Non-Monetary | 10,505 | 5 | 0.5% | 97 | 1.0% |
| Monetary | 495 | 1 | 2.2% | 16 | 3.6% |
Hypothesis: H0: Severity has no marginal effect upon question rates
Test: Chi-squared test for independence (approximation of GEE)
Effect Sizes:
Test Statistic: χ²(3) = 29.451
p-value: 0.0000
Conclusion: The null hypothesis was rejected (p < 0.05)
Practical Significance:
The analysis reveals multiple perspectives on process bias by severity:
Interpretation: Based on the primary effect size metric (Cramér's V = 0.014), this result is statistically significant but practically trivial (large sample size may detect trivial differences). The analysis shows how question rates vary by severity both in baseline conditions and when persona injection is applied, revealing potential process bias patterns.
Implication: There is strong evidence that severity has an effect upon process bias via question rates.
Note: Full GEE implementation would cluster by case_id and use robust Wald tests
| Severity Category | Count | Baseline Question Count | Baseline Question Rate % | Persona-Injected Question Count | Persona-Injected Question Rate % |
|---|---|---|---|---|---|
| Non-Monetary | 9,735 | 1 | 0.1% | 14 | 0.2% |
| Monetary | 1,265 | 0 | 0.0% | 1 | 0.1% |
Hypothesis: H0: Severity has no marginal effect upon question rates
Test: Chi-squared test for independence (approximation of GEE)
Effect Sizes:
Test Statistic: χ²(3) = 0.602
p-value: 0.8961
Conclusion: The null hypothesis was not rejected (p ≥ 0.05)
Practical Significance:
The analysis reveals multiple perspectives on process bias by severity:
Interpretation: Based on the primary effect size metric (Cramér's V = 0.000), this result is not statistically significant (effect size: negligible). The analysis shows how question rates vary by severity both in baseline conditions and when persona injection is applied, revealing potential process bias patterns.
Implication: There is no evidence that severity affects process bias via question rates.
Note: Full GEE implementation would cluster by case_id and use robust Wald tests
Analysis of how bias mitigation strategies affect tier recommendations in LLM decision-making.
| Baseline Tier | Mitigation Tier 0 | Mitigation Tier 1 | Mitigation Tier 2 |
|---|---|---|---|
| Tier 0 | 14,900 | 1,927 | 153 |
| Tier 1 | 1,722 | 9,327 | 621 |
| Tier 2 | 100 | 144 | 1,106 |
| Baseline Tier | Mitigation Tier 0 | Mitigation Tier 1 | Mitigation Tier 2 |
|---|---|---|---|
| Tier 0 | 10,199 | 2,665 | 96 |
| Tier 1 | 2,715 | 10,426 | 449 |
| Tier 2 | 195 | 890 | 2,365 |
| Decision Method | Persona Matches | Persona Non-Matches | Persona Tier Changed % | Mitigation Matches | Mitigation Non-Matches | Mitigation Tier Changed % |
|---|---|---|---|---|---|---|
| n-shot | 22,935 | 7,065 | 23.5% | 22,990 | 7,010 | 23.4% |
| zero-shot | 25,296 | 4,704 | 15.7% | 25,333 | 4,667 | 15.6% |
Hypothesis: H0: Bias mitigation has no effect on tier selection bias
Test: Chi-squared test for independence
Mitigation Effect Analysis:
Effect Size (Cohen's h): -0.003 (negligible)
Test Statistic: χ²(3) = 1173.409
p-value: 0.0000
Conclusion: The null hypothesis was rejected (p < 0.05)
Implication: The bias mitigation strategies have negligible impact on reducing bias. Alternative mitigation approaches should be explored.
| Risk Mitigation Strategy | Sample Size | Mean Baseline | Mean Persona | Mean Mitigation | Residual Bias % | Std Dev | SEM |
|---|---|---|---|---|---|---|---|
| Perspective | 4,346 | 0.492 | 0.523 | 0.489 | 11.2% | 0.609 | 0.009 |
| Roleplay | 4,274 | 0.478 | 0.522 | 0.497 | 42.9% | 0.612 | 0.009 |
| Minimal | 4,226 | 0.476 | 0.518 | 0.453 | 55.1% | 0.596 | 0.009 |
| Persona Fairness | 4,289 | 0.471 | 0.519 | 0.510 | 81.2% | 0.621 | 0.009 |
| Chain Of Thought | 4,164 | 0.479 | 0.521 | 0.513 | 81.8% | 0.614 | 0.010 |
| Structured Extraction | 4,349 | 0.482 | 0.521 | 0.526 | 110.5% | 0.606 | 0.009 |
| Consequentialist | 4,352 | 0.475 | 0.514 | 0.548 | 191.0% | 0.625 | 0.009 |
Hypothesis: H0: All bias mitigation methods are just as effective (or ineffective) as one another
Model: Linear Mixed-Effects Model (subject-specific interpretation) - Model: bias ~ mitigation + persona [+ mitigation:persona] + (1 | case_id)
Test: Likelihood-ratio test comparing models with vs without the mitigation term (approximated by repeated-measures ANOVA)
Test Statistic: F = 0.6254628043404222
p-value: 0.710082
Effect Size (η²): 0.000536 (negligible)
Conclusion: The null hypothesis was not rejected (p 0.710)
Practical Significance: This result is not statistically significant (effect size: negligible).
Implication: There is no evidence that bias mitigation strategies differ in effectiveness.
Note: Analysis based on Linear Mixed-Effects Model with case_id as random effect. Full implementation would use specialized mixed-effects libraries.
| Risk Mitigation Strategy | Sample Size | Mean Baseline | Mean Persona | Mean Mitigation | Residual Bias % | Std Dev | SEM |
|---|---|---|---|---|---|---|---|
| Consequentialist | 4,257 | 0.685 | 0.671 | 0.719 | 240.0% | 0.643 | 0.010 |
| Chain Of Thought | 4,294 | 0.690 | 0.688 | 0.685 | 262.5% | 0.663 | 0.010 |
| Perspective | 4,332 | 0.693 | 0.687 | 0.659 | 503.3% | 0.650 | 0.010 |
| Persona Fairness | 4,230 | 0.683 | 0.679 | 0.657 | 611.1% | 0.643 | 0.010 |
| Roleplay | 4,345 | 0.674 | 0.667 | 0.632 | 613.3% | 0.638 | 0.010 |
| Structured Extraction | 4,335 | 0.684 | 0.684 | 0.666 | 7900.0% | 0.650 | 0.010 |
| Minimal | 4,207 | 0.671 | 0.670 | 0.602 | 9600.0% | 0.635 | 0.010 |
Hypothesis: H0: All bias mitigation methods are just as effective (or ineffective) as one another
Model: Linear Mixed-Effects Model (subject-specific interpretation) - Model: bias ~ mitigation + persona [+ mitigation:persona] + (1 | case_id)
Test: Likelihood-ratio test comparing models with vs without the mitigation term (approximated by repeated-measures ANOVA)
Test Statistic: F = 0.9048307673802451
p-value: 0.490149
Effect Size (η²): 0.000776 (negligible)
Conclusion: The null hypothesis was not rejected (p 0.490)
Practical Significance: This result is not statistically significant (effect size: negligible).
Implication: There is no evidence that bias mitigation strategies differ in effectiveness.
Note: Analysis based on Linear Mixed-Effects Model with case_id as random effect. Full implementation would use specialized mixed-effects libraries.
| Condition | Total Cases | Questions Asked | Question Rate |
|---|---|---|---|
| Baseline (No Mitigation) | 1,000 | 6 | 0.0060 |
| Mitigation (All Strategies) | 30,000 | 578 | 0.0193 |
| Chain Of Thought | 4,164 | 91 | 0.0219 |
| Consequentialist | 4,352 | 90 | 0.0207 |
| Minimal | 4,226 | 66 | 0.0156 |
| Persona Fairness | 4,289 | 72 | 0.0168 |
| Perspective | 4,346 | 56 | 0.0129 |
| Roleplay | 4,274 | 99 | 0.0232 |
| Structured Extraction | 4,349 | 104 | 0.0239 |
Hypothesis: H0: Question rates are the same with and without bias mitigation
Test: Chi-squared test on question counts
Effect Size (Cramér's V): 0.017 (negligible)
Test Statistic: χ² = 8.511
p-Value: 0.004
Conclusion: The null hypothesis was rejected (p < 0.05)
Practical Significance: This result is statistically significant but practically trivial (large sample size may detect trivial differences).
Implication: Bias mitigation increases question rates by 0.0133 (221.1% increase)
| Condition | Total Cases | Questions Asked | Question Rate |
|---|---|---|---|
| Baseline (No Mitigation) | 1,000 | 1 | 0.0010 |
| Mitigation (All Strategies) | 30,000 | 125 | 0.0042 |
| Chain Of Thought | 4,294 | 23 | 0.0054 |
| Consequentialist | 4,257 | 14 | 0.0033 |
| Minimal | 4,207 | 21 | 0.0050 |
| Persona Fairness | 4,230 | 16 | 0.0038 |
| Perspective | 4,332 | 9 | 0.0021 |
| Roleplay | 4,345 | 17 | 0.0039 |
| Structured Extraction | 4,335 | 25 | 0.0058 |
Hypothesis: H0: Question rates are the same with and without bias mitigation
Test: Chi-squared test on question counts
Effect Size (Cramér's V): 0.007 (negligible)
Test Statistic: χ² = 1.679
p-Value: 0.195
Conclusion: The null hypothesis was not rejected (p ≥ 0.05)
Practical Significance: This result is not statistically significant (effect size: negligible).
Implication: There is no evidence that bias mitigation affects question rates
| Ground Truth \ LLM | Tier 0 | Tier 1 | Tier 2 |
|---|---|---|---|
| Tier 0 | 473 | 307 | 28 |
| Tier 1 | 77 | 46 | 2 |
| Tier 2 | 16 | 36 | 15 |
| Decision Method | Experiment Category | Sample Size | Correct | Accuracy % |
|---|---|---|---|---|
| n-shot | Baseline | 1,000 | 478 | 48% |
| n-shot | Bias Mitigation | 30,000 | 14,121 | 47% |
| n-shot | Persona-Injected | 10,000 | 4,543 | 45% |
| zero-shot | Baseline | 1,000 | 534 | 53% |
| zero-shot | Bias Mitigation | 30,000 | 15,871 | 53% |
| zero-shot | Persona-Injected | 10,000 | 5,156 | 52% |
Financial Services Impact
The statistical findings indicate a significant bias in the bank's LLM complaint handling system, particularly affecting the treatment of complaints of varying severity. With a p-value far below the standard threshold of 0.05 and a large Cohen's d effect size, this bias is not only statistically significant but also practically meaningful. This suggests that the system's responses to complaints may disproportionately disadvantage or advantage certain demographic groups, especially in more severe cases. Such disparities could lead to regulatory scrutiny under the Fair Lending Act, which mandates equitable treatment across all customer segments. Additionally, the Consumer Financial Protection Bureau (CFPB) has been increasingly focused on the fairness of AI systems in financial services, making this finding a potential trigger for enforcement actions, including fines and mandates for corrective measures.
Customer Experience Implications
Although the specific demographic groups affected by this bias are not detailed in the provided findings, the presence of persona injection bias suggests that certain characteristics (e.g., age, gender, race, or income level) could influence the system's severity assessment and subsequent handling of complaints. This could lead to a perception of unfair treatment among those negatively impacted, eroding trust not only in the bank's complaint resolution process but also in its broader commitment to equitable customer service. For groups that are systematically disadvantaged by this bias, the impact could range from receiving inadequate remedies for serious complaints to feeling alienated from seeking redress through official channels.
Recommended Actions
Monitoring Strategy
Given the zero-shot nature of the test indicating this bias, a continuous monitoring approach should be adopted that leverages both zero-shot and n-shot learning evaluations. This would involve periodically reassessing the LLM's performance using unseen data (zero-shot) to ensure that initial biases are not being perpetuated. Additionally, incorporating n-shot learning, where the model is provided with a few examples before making predictions, could help in fine-tuning the system's fairness over time. Specific attention should be paid to the Tier Impact Rate, tracking any shifts in how different severity levels of complaints are handled across demographic groups. This targeted monitoring should be complemented by regular audits of the outcomes of severe complaints, ensuring that any emerging biases are identified and addressed promptly.